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Abstract 

Parallel  solution  of  1-indexed  recurrence  relations  has  received  much  attention,  but  no  effort 
has  been  made  to  design  parallel  algorithms  for  multi-indexed  recurrence  relations,  where  the 
recurrence  is  in  more  than  one  index.  Multi-indexed  relations  arise  in  JPEG-standard  DPCM 
compression  of  images,  which  is  an  increasingly  important  area  due  to  the  explosion  of  imagery 
applications  such  as  multimedia  and  medical  imaging.  In  this  paper  we  design  and  analyze  novel 
parallel  algorithms  for  solving  multi-indexed  recurrence  relations  of  arbitrary  order.  To  that 
end,  we  develop  three  techniques:  index  sequencing , index  decoupling , and  dimension  shifting. 
Index  sequencing  leads  to  a parallel  algorithm  of  0(n  log  n)  time  on  n processors,  where  the 
input  size  is  n2.  Index  decoupling  applies  in  limited  situations,  one  of  which  is  a commonly 
used  special  case  of  DPCM,  and  leads  to  a parallel  algorithm  of  O(logn)  time  for  n x n images 
on  n x n processors.  Dimension  shifting  is  a universal  technique  for  multi-indexed  relations, 
and  involves  reducing  the  number  of  indices  to  1 while  clustering  the  terms  of  the  relation  into 
vectors.  This  is  accomplished  by  means  of  special  vector  operators  that  we  define  and  study. 
Our  parallel  algorithm  for  doing  the  dimension  shifting  and  for  solving  the  resulting  1-indexed 
recurrence  relation  takes  0(log2  N ) time  on  meshes  of  partitionable  buses  or  hypercubes,  where 
N is  the  data  input-output  size. 

Keywords:  Compression,  DPCM,  Parallel  Algorithms,  Recurrence  Relations,  Dimension  Shift- 
ing, Convolution,  Fourier  Transform. 
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1 Introduction 


The  massive  amounts  of  imagery  data  in  many  applications  demand  that  images  be  compressed 
to  reduce  their  storage  and  transmission  requirements.  In  certain  applications,  most  notably 
medical  imaging,  the  compression  has  to  be  lossless.  One  of  the  well-known  lossless  image 
compression  techniques  is  differential  pulse-code  modulation  (DPCM),  which  was  adopted  by 
the  Joint  Photographic  Expert  Group  (JPEG)  as  an  international  standard  [12]. 

Whenever  an  image  is  retrieved,  locally  or  remotely,  it  has  to  be  decompressed  online  for 
display.  Therefore,  decompression,  also  called  decoding,  must  be  as  fast  as  possible.  Compres- 
sion, also  called  coding,  is  desired  to  be  fast  too,  but  its  speed  is  not  as  critical  as  the  decoding 
speed  because  coding  is  typically  performed  just  once,  often  offline,  whereas  decoding  is  done 
many  times,  often  online  or  in  real  time.  Consequently,  it  is  especially  important  to  develop 
fast  decoding  algorithms. 

DPCM  decoding  is  essentially  the  computation  of  a 2-indexed  scalar  recurrence  relation 
of  low  order;  every  term  in  the  recurrence  relation  is  indexed  with  two  indices,  the  row  and 
column  positions  of  pixels.  Similarly,  DPCM  of  3D  and  4D  images,  which  is  conceivable  in 
applications  involving  3D  visual  modeling  with  or  without  motion,  leads  to  3-indexed  and 
4-indexed  recurrence  relations.  Much  work  has  been  done  on  parallel  algorithms  for  solving 

1- indexed  recurrence  relations  of  arbitrary  order  [2,  6,  7,  13],  due  to  their  applicability  in 
numerical  computations  [16,  17,  18];  the  fastest  parallel  algorithms  for  solving  those  equations 
take  logarithmic  time,  and  are  based  on  parallel  prefix  computation  [1,  4,  8,  9,  10,  11,  13,  15]. 
However,  no  work  has  been  reported  on  multi-indexed  recurrence  relations.  Considering  the 
importance  of  DPCM  coding/decoding  of  imagery,  and  the  need  for  fast  decoding,  parallel 
algorithms  for  solving  multi-indexed  recurrence  relations  merit  serious  study. 

In  this  paper  we  design  and  analyze  parallel  algorithms  for  solving  multi-indexed  recurrence 
relations  of  arbitrary  order,  and  identify  the  architectures  best  suited  for  the  algorithms.  For 
clarification,  the  multi-indexed  relations  are  addressed  in  increasing  order  of  complexity.  Two- 
indexed  recurrence  relations  of  order  1 are  addressed  first,  2-indexed  recurrence  relations  of 
arbitrary  order  are  addressed  second,  and  multi-indexed  relations  of  arbitrary  order,  where  the 
number  of  indices  is  greater  than  2,  are  addressed  last, 

We  develop  three  approaches  for  solving  multi-indexed  recurrence  relations:  index  sequenc- 
ing, index  decoupling , and  dimension  shifting.  The  index  sequencing  approach  for  solving  a 

2- indexed  relation  to  compute  n x n terms,  as  in  DPCM  decoding  of  an  n x n image2,  breaks 
down  the  relation  into  a sequence  of  n 1-indexed  scalar  recurrence  relations  that  must  be  solved 
one  after  another.  Each  relation  can  be  solved  by  a parallel  O (log  72)  time  algorithm  on  an 
n-processor  hypercube  or  partitionable  bus,  and  thus  the  n equations  take  0(n\ogn)  time  on 
n processors.  This  approach  does  not,  however,  exploit  the  full  potential  of  parallelism. 

Index  decoupling,  when  applicable,  breaks  the  2-indexed  relation  into  many  independent 
1-indexed  recurrence  relations,  which  can  then  be  solved  simultaneously,  leading  to  an  overall 
logarithmic  time  parallel  algorithm.  We  will  apply  this  approach  to  one  special  but  commonly 
used  DPCM.  However,  since  index  decoupling  does  not  apply  universally  to  arbitrary  multi- 

2 All  the  algorithms  in  the  paper  apply  equally  well  on  n x m rectangular  images,  but  for  ease  of  presentation 
we  will  consider  only  n x n square  images  in  the  2-indexed  relations  case. 
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indexed  recurrence  relations,  alternative  approaches  must  be  sought. 

Dimension  shifting  is  a universal  approach.  To  understand  it,  one  must  note  the  three 
dimensionalities  in  a multi-indexed  recurrence  relation:  the  term  dimensionality , the  equation 
order , and  the  index  dimensionality.  The  first  refers  to  the  dimension  of  every  term  in  the 
recurrence  relation,  and  the  second  indicates  the  number  of  previous  terms  each  term  depends 
on.  The  index  dimensionality  is  the  number  of  indices  q per  term.  Dimension  shifting  involves 
reducing  the  index  dimension  of  ^-indexed  relations  down  to  1 while  increasing  the  term  dimen- 
sion by  a certain  factor,  then  reducing  the  order  dimension  down  to  1 while  increasing  the  term 
dimension  by  another  appropriate  factor.  This  yields  a 1-indexed  vector  recurrence  relation  of 
order  one,  where  the  vectors  are  (q  — l)-dimensional  arrays.  The  operations  involved  in  the 
resulting  recurrence  relation  are  multidimensional-vector  convolution  and  addition.  Since  mul- 
tidimensional convolution  can  be  done  in  parallel  in  logarithmic  time  using  multidimensional 
FFT  [3,  5,  20],  and  since  standard  parallel  algorithms  for  1-indexed  recurrence  relations  of  order 
1 take  a logarithmic  number  of  operations  [2,  6,  7,  13],  the  total  time  of  the  algorithm  is  square- 
logarithmic,  using  N processors  configured  as  a ^-dimensional  mesh  of  partitionable  buses  or 
hypercubes,  where  N is  the  output  size.  Note  that  converting  a ^-indexed  recurrence  relation 
to  a 1-indexed  relation  involves  solving,  recursively,  a (q  — l)-indexed  recurrence  relation. 

Two  points  are  noteworthy.  First,  dimension  shifting  has  been  applied  to  1-indexed  multi- 
order  recurrence  relations  to  reduce  the  order  down  to  1 and  increase  the  term  dimension  [7]; 
however,  dimension  shifting  of  the  index  dimension  to  term  dimension  is  novel,  and  involves 
some  intricate  operators  and  operator  manipulation.  Second,  dimension  shifting  could  be  done 
more  easily  by  shifting  the  index  dimension  to  the  order  dimension  first.  For  example,  the 
2-indexed  terms  of  an  n x n image  can  be  re-indexed  with  one  index  by  ordering  the  pixels  in 
some  manner  such  as  row-wise  or  column-wise.  This  re-indexing  turns  the  recurrence  relation 
into  a 1-indexed  relation  without  changing  the  dimension  of  the  terms  (e.g.,  pixels),  but  it 
increases  the  order  of  the  relation  dramatically.  Reducing  the  order  afterwords  requires  an 
n-fold  increase  in  the  term  dimension,  and  the  corresponding  parallel  algorithm  to  solve  the 
resulting  equation  takes  0(n  log  n)  time  on  an  n x n processor  system,  which  is  considerably 
longer  than  the  square-logarithmic  time  mentioned  above.  Therefore,  our  dimension  shifting 
approach  outlined  in  the  previous  paragraph  is  superior. 

The  paper  is  organized  as  follows.  The  next  section  overviews  DPCM  and  formulates  the 
DPCM  decoding  problem  as  a multi-indexed  scalar  recurrence  relation.  Sections  3 and  4 de- 
velop the  index-sequencing  and  index-decoupling  approaches,  respectively.  Section  5 presents 
a vector-operator  based  formulation  of  2-indexed  recurrence  relations  of  order  1,  defining  the 
operators  and  proving  some  of  their  relevant  properties.  In  section  6,  parallel  algorithms  for 
setting  up  and  solving  the  resulting  1-indexed  vector  recurrence  relations  are  developed  and 
analyzed.  Section  7 addresses  arbitrary-order  2-indexed  recurrence  relations,  and  develops  a 
parallel  algorithm  for  them.  Section  8 will  address  the  most  general  case,  namely,  multi-indexed 
recurrence  relations  of  arbitrary  order,  where  the  number  of  indices  is  greater  than  2.  Finally, 
section  9 closes  the  paper  with  conclusions  and  future  directions. 
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2 DP  CM  and  Multi-indexed  Recurrence  Relations 


The  most  widely  used  form  of  DPCM  is  overviewed  first,  and  formulated  into  a 2-indexed  re- 
currence relation  of  order  1.  Afterwards,  higher  order  and  higher  dimension  DPCM  is  presented 
and  formulated  into  a multi-indexed  recurrence  relation  of  the  same  order. 

DPCM  assumes  that  the  interpixel  redundancies  and  correlations  can  be  modeled  as  a 2D 
Markov  model  of  some  order  [5].  For  order  1,  the  model  involves  three  parameter,  a,  b and  c, 
as  described  next.  Let  X(0  : n — 1, 0 : n — 1)  be  a matrix  of  pixels,  representing  an  image.  The 
model  assumes  that  every  pixel  X(i,  j)  can  be  largely  “predicted”  from  its  three  neighbors  in 
the  north,  west,  and  north-west  as  follows: 


X(iJ)  = aX{iJ  - l)  + bX(i  - 1,  j)  + cX(i  - 1J  - 1)  + E{iJ) 


(1) 


where  E{i,j ) is  the  'prediction  error , or  residual.  Note  that  when  i and  j are  outside  their 
appropriate  range,  the  corresponding  value  for  X(i,j)  is  assumed  to  be  zero.  Some  of  the 
commonly  used  values  for  (a,  d,  c)  are  (1,1,-1),  (1/2, 1/2,0),  and  (3/4, 3/4, -1/2).  The  (1,1,-1) 
is  often  used,  partly  because  of  its  arithmetic  simplicity  (no  division  or  multiplication).  This 
special  case  will  be  treated  separately  using  the  index  decoupling  approach. 

DPCM  coding  involves  first  computing  the  residuals  E{i,j)  using  the  following  equation 
derived  from  the  model  of  equation  (1): 


E{i,j)  = X(i,j)  - (aX(iJ  - l)+bX(z  - 1 ,j)  + cX(i  - 1 ,j  - 1)) 


(2) 


and  then  encode  the  residuals  into  a bit  stream,  using  an  entropy  coder  such  as  Huffman  coding, 
run-length  encoding,  or  arithmetic  coding,  which  are  lossless  [14].  The  residuals  tend  to  be  of 
small  values  and  of  high  frequencies  of  occurrence,  enabling  the  entropy  coder  to  reduce  the 
data  size,  typically  by  a factor  of  2,  3 or  4. 

The  computation  of  E is  easily  parallelizable  on  an  n x n mesh,  assuming  the  whole  matrix 
X is  available  in  the  processors  ( X(i,j ) in  processor  P(i,  j)).  Each  P(i,j)  sends  its  data  X{i,  j) 
to  its  three  neighbors  to  the  south,  east,  and  south-east.  Afterwards,  each  processor  (i,  j)  has 
all  the  data  it  needs  to  compute  E(i,j ) according  to  equation  (2).  The  communication  and  the 
parallel  computation  take  constant  time.  Of  course  the  entropy  coding  of  E will  have  to  be 
parallelized,  but  this  is  left  out  of  our  scope  because  it  calls  for  different  approaches. 

DPCM  decoding  decodes  the  bit  stream  back  to  the  residuals  E(i,j),  and  then  computes 
the  pixel  values  X(i,j ) according  to  equation  (1).  Clearly,  equation  (1)  is  a 2-indexed  scalar 
recurrence  relation  of  order  one,  which  will  be  solved  in  parallel  in  Sections  3 and  4. 

The  generalization  of  equation  (1)  comes  from  higher  order  DPCM  and  from  multidimen- 
sional imagery.  A DPCM  of  order  (k,  t ) and  parameter  array  a(0  : k,0  : t)  assumes  a 2D 
Markov  model  of  images  where  every  pixel  can  be  largely  predicted  from  its  north-west  rectan- 
gular neighborhood  that  extends  k steps  to  the  north  and  t steps  to  the  west.  The  governing 
DPCM  equation,  which  is  a generalization  of  equation  1 is  the  following  2-indexed  recurrence 
relation  of  order  ( k,t ): 


k t 


X(i,j)  = ’EH “(i. s)X(i  - U -s)  + E(i,j) 


(3) 


1=0  s=0 
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where  a( 0,0)  is  equal  to  0 so  that  X(i,j)  does  not  depend  on  itself;  the  term  a(0,0)  is  kept  to 
simplify  the  expression  of  the  double  summation  in  equation  (3).  Here  again,  when  the  indices 
step  out  of  their  appropriate  bounds,  the  corresponding  terms  are  assumed  to  be  zero. 

Finally,  DPCM  of  order  (ti,t2,...  ,tq)  and  parameter  array  a(0  : *i,0  : t2,...,0  : tq),  for  q- 
dimensional  imagery  data,  involves  a ^-indexed  recurrence  relation  of  order  ....  tq)  in  the 

form: 

t\  <2  *9 

X(ii,i2,.:,iq)  = zL  - 51  a{T\ir2i-irq)X(ii-rl,i2-r2,...,iq-Tq)+E(il,i2,....iq),  (4) 

ri=0r2=0  rq= 0 

where  a(0,0,  ...,0)  is  equal  to  0 so  that  X(ii,  z2, ...,  iq)  does  not  depend  on  itself.  As  before, 
any  term  is  assumed  to  be  zero  if  its  indices  are  out  of  their  defined  bounds.  Note  that  when 
tl=t2  = ...  = tq  = t,  the  recurrence  relation  is  said  to  be  of  order  t. 

In  this  paper  we  will  develop  parallel  algorithms  for  solving  equations  (1),  (3)  and  (4),  in 
that  order  of  increasing  difficulty.  We  will  start  with  the  index  sequencing  and  index  decoupling 
techniques  first,  due  to  their  simplicity,  and  then  move  on  to  the  general  dimension  shifting 
technique. 


3 Index  sequencing 

Index  sequencing  applies  to  equation  (1)  to  compute  X from  E as  follows.  For  each  fixed  row  z, 
in  the  sequential  order  i = 0, 1, ...,  n — 1,  equation  (1)  is  a 1-indexed  recurrence  relation  of  order 
1,  where  the  index  is  j,  and  the  last  three  terms,  i.e.,  X(i  — 1,  j),  (X(z  — 1 ,j  — 1)  and  E(i,j ), 
are  known  and  available.  Here  is  the  algorithm,  assuming  an  n-processor  system  configured 
as  a hypercube  or  a partitionable  bus.  Processor  j hosts  column  j of  E,  and  is  responsible 
for  computing  column  j of  X.  Also,  a,  b and  c are  assumed  to  be  stored  in  all  the  processors. 


Algorithm  Index-sequencing(input:  X,  a,  b,  c;  output:  X) 
begin 

1.  for  i = 0 to  n — 1 do 

2.  Each  processor  j — 1 sends  its  X(i  — 1,  j — 1)  to  processor  j; 

3.  Each  processors  j computes  dj  = bX(i  — 1 ,j)  + cX(i  — 1,  j - 1)  + E(i,  j); 

4.  All  the  processors  solve  in  parallel  the  1-indexed  recurrence  relation  of  order  1 

X(iJ)  = aX  (z,  j - 1)  +dj, 

where  the  recurrence  index  is  j , thus  computing  row  z of  X; 

endfor 

end 

Analysis 

Step  2 takes  0(1)  time  on  a partitionable  bus,  or  O(logrz)  time  on  a hypercube,  because 
the  step  is  a translation  (i.e.,  shift)  communication  step.  Step  3 takes  0(1)  time.  Finally, 
step  4 takes  O(logn)  time  using  any  standard  parallel  algorithm  for  solving  scalar  1-indexed 
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recurrence  relations  on  n processors  [2,  6,  7,  13].  Consequently,  step  1,  and  hence  the  whole 
algorithm,  takes  O(nlogn)  time.  The  corresponding  speedup  is  0(n^gn)  = and  the 

efficiency  is  0(1/ log n). 

Clearly,  the  above  algorithm  has  the  advantages  of  good  speedup,  good  efficiency,  and  rela- 
tively small  number  of  processors.  However,  it  does  not  exploit  the  full  potential  for  parallelism, 
and  thus  other  techniques  are  called  for. 


4 Index  Decoupling:  A Special  Case  of  DPCM 


As  mentioned  earlier,  among  the  most  common  values  for  the  parameter  triplet  (a,  6,  c)  is 
(1, 1,  — 1).  In  this  section  we  will  treat  this  special  case  due  to  its  popularity.  Index  decoupling 
will  apply  to  any  situation  where  a = 1 and  c — —b. 

Under  the  assumption  that  a = 1 and  c — —6,  equation  I becomes: 

x(i,  j)  = X(Lj  - 1)  + b(X(i  - 1, j)  -X(i-l,j-  1))  + 

which  is  equivalent  to 


X(i,j)  - X(i,j  - 1)  = b (X(i  - 1 ,j)  - X(i  - 1,  j - 1))  + E(i,j).  (5) 


By  defining 


Y(i,j)  = X(i,j)-X(i,j-  1), 


equation  (5)  becomes: 


Y{iJ)  =bY(i-  1 ,j)  + E{i,j). 


(6) 

(7) 


Observe  that  equation  (7)  consists  of  n independent  equations,  one  per  column  j , where  each 
equation  is  a scalar  1-indexed  recurrence  relation  of  order  1,  and  the  recurrence  index  is  i.  The 
initial  value  — 0 for  all  j.  Due  to  independence,  those  n recurrence  relations  can 

be  solved  simultaneously,  where  each  equation  can  be  solved  by  a parallel  recurrence  relation 
algorithm  on  a separate  subsystem. 


Once  Y is  computed,  X can  be  computed  using  a simple  relation  derived  from  equation  (6): 


X(i,j)  = X(i,j-l)  + Y(i,j).  (8) 

Here  again,  equation  (8)  consists  of  n independent  equations,  one  per  row  i , where  each  equa- 
tions is  a scalar  1-indexed  recurrence  relation  of  order  1,  and  the  recurrence  index  is  j.  The 
initial  value  X(i , —1)  = 0 for  all  i.  Also,  due  to  independence,  those  n recurrence  relations  can 
be  solved  simultaneously,  where  each  equation  can  be  solved  by  a parallel  prefix  algorithm  on 
a separate  subsystem.  Thus,  equations  (7)  and  (8)  form  the  essence  of  the  index  decoupling 
approach,  since  they  decouple  equation  (1)  into  two  separate  sets  of  1-indexed  relations.  The 
algorithm  for  solving  equations  (7)  and  (8)  is  given  next,  assuming  an  architecture  of  2D  mesh 
where  every  row  and  every  column  is  a hypercube  or  a partitionable  bus,  because  a hypercube 
or  a partitionable  bus  is  an  ideal  architecture  for  solving  1-indexed  recurrence  relations  [19,  20]. 
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Algorithm  Index-decoupling(input:  E;  output:  A') 
begin 

1.  for  j=0  to  n — 1 pardo 

2.  solve  the  relation  Y (i,j)  = bY{i  — 1,  j)  + E(i:j)  on  a separate  mesh  column  ; 

endfor 

3.  for  2—0  to  n — 1 pardo 

4.  call  parallel  prefix  on  X(i,j)  = X{i,j  — 1)  -f  Y{i,  j)  to  run  on  a separate  mesh  row; 

endfor 

end 


Analysis 

First  assume  an  n x n mesh  of  hypercubes  or  partitionable  buses.  For  each  j , step  2 runs 
on  column  j of  the  mesh,  and  takes  O(logn)  time  because  it  is  a 1-indexed  scalar  recurrence 
relation  or  order  1.  Therefore,  the  whole  for-loop  in  step  1 takes  O(logn).  Similarly,  for  each 

2,  step  4 runs  on  row  i of  the  mesh  and  takes  O(logn)  time  for  the  same  reason.  Accordingly, 
the  whole  for-loop  in  step  3 takes  O(logn)  time.  Therefore,  the  whole  algorithm  takes  O(logn) 
time,  which  is  faster  than  the  previous  algorithm  by  a logarithmic  factor.  The  speedup  over 
the  sequential  algorithm  is  clearly  O(j^),  and  the  efficiency  is 

If  the  system  is  a pxp  mesh  of  hypercubes  or  partitionable  buses,  where  p < n,  then  divide 
each  of  the  matrices  A,  Y,  and  E into  pxp  square  blocks  of  size  | x ^ each,  and  let  the  (2,  j) 
block  reside  in  processor  (i,j).  Since  each  1-indexed  recurrence  relation  takes  0(n/p  + log p) 
time  on  p processors,  and  since  each  row  or  column  of  processors  has  to  solve  n/p  recurrence 
relations,  it  follows  that  each  for-loop,  and  thus  the  whole  algorithm,  takes  0((n/p  + log p)n/p) 
time.  If  plogp  < n,  then  the  algorithm  takes  0{n* 2/p 2)  time,  resulting  in  0(p2)  speedup  and 
0(1)  efficiency,  which  are  optimal  up  to  a constant  factor. 

Clearly,  index  dimension  yields  a time-optimal  logarithmic  parallel  algorithm  for  the  popular 
special  case  of  DPCM  where  (a,  6,  c)  = (1,  —1, 1).  However,  it  does  not  apply  to  the  general 
case  of  arbitrary  values  of  (a,  6,  c)  or  to  multi-indexed  relations  of  arbitrary  order. 

The  remainder  of  the  paper  will  develop  the  universal  dimension-shifting  approach  to  derive 
square-logarithmic-time  parallel  algorithms  for  arbitrary  multi-indexed  recurrence  relations  of 
arbitrary  order,  starting  with  2-indexed  relations  of  order  1,  and  then  generalizing  to  2-indexed 
relations  of  arbitrary  order  and  finally  to  multi-indexed  relations. 
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5 Dimension  Shifting:  Conversion  to  1-indexed  Vector 
Recurrence  Relations 


One  fundamental  idea  in  dimension  shifting  is  the  use  of  the  shift  operator  S on  column  vectors. 
If  V(0  : n — 1)  is  a column  vector,  then  SV  is  another  vector  W(0  : n — 1)  derived  from  V by 
shifting  it  down  one  position:  for  all  i > 0,  W(i)  = V(i  — 1),  and  1T(0)  = 0.  The  operator 
S can  be  applied  repeatedly  (say,  r times)  on  a vector.  This  is  equivalent  to  composition  of 
operators.  Denote  by  ST  the  resulting  operator.  Clearly,  <SrI/  is  a vector  derived  from  V by 
shifting  the  latter  down  r times,  that  is,  for  all  i = 0, 1, ...,  n — 1,  (<SrU)(z)  = V(i  — r),  where 
a term  is  0 when  its  index  is  out  of  bounds.  Two  special  cases  of  r are  worth  noting:  r = 0 
and  r = n.  S°  is  by  convention  the  identity  operator  X,  that  is,  XV  = V for  any  vector  V. 
For  convenience,  we  will  sometimes  use  1 instead  X.  When  r = n,  observe  that  SnV  = 0,  the 
zero  vector  of  length  n.  Accordingly,  we  simply  denote  the  operator  Sn  by  0.  Note  that  for  all 
r > n,  ST  = 0.  Polynomial  operators  in  <S,  that  is,  any  linear  combination  of  the  powers  of  <S, 
can  be  easily  defined  as  follows. 

Definition  1 Given  a real  sequence  a0,  a1? ...,  an_i,  a polynomial  operator  T = YaTo 1 ai$l  zs 
defined  to  be  such  that  for  any  vector  V,  TV  = YaTq  ai(S'V).  The  vector  T{ 0 : n — 1),  where 
T(i)  = a*  for  all  i,  is  called  the  characteristic  vector  of  operator  T. 

Throughout  the  paper,  the  following  notational  convention  will  be  followed.  For  any  matrix 
M(0  : n — 1,0  : n — 1),  denote  by  Mj  the  j-column  of  M,  so  that  Mj(i)  = M(i,j).  Now  we  are 
in  a position  to  cast  equation  1 (and  equation  3 later  on)  of  DPCM  into  a new  form. 

Recall  that  equation  1 is 

X(i,j)  = aX (i,  j - 1 ) + bX(i-  1 ,j)  + cX(i  - l,j  - 1 ) + E(i,j), 

which,  when  using  Mj(i)  for  M(i,j ),  becomes 

Xj{i)  = aXj-i{i)  + bXj{i  — 1)  4-  cXj~i(i  — 1)  + Ej(i). 

Using  the  definition  of  <S,  we  have  Xj(i  — 1)  — («SXj)(i)  and  Xj-i(i  — 1)  = (<SX,_i)(z).  Conse- 
quently, the  above  equation  becomes 

Xj(i)  = aXj_i(i)  + b{SXj)(i)  + c(<SXj_i)(i)  + Ej(i),  for  all  i. 

Noting  that  the  last  equation  involves  only  the  i-th  components  of  the  various  column  vectors, 
the  index  i can  be  dropped,  and  the  equation  becomes  a condensed  vector  equation: 

Xj  — aXj—i  + bSXj  + cSXj—i  + Ej. 

By  moving  bSXj  to  the  left,  and  “factoring  out”  Xj  on  the  left  and  Xj- 1 on  the  right,  we 
obtain 

(1  - bS)Xj  = ( al  + cS)Xj-i  + Ej.  (9) 

The  question  now  is  how  to  get  rid  of  the  “coefficient”  operator  of  Xj  on  the  left  hand  side. 
Essentially,  the  question  is  whether  there  is  an  inverse  operator  (1  — 6<S)_1;  if  such  an  inverse 
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exists,  we  can  multiply  both  sides  of  the  equation  by  that  inverse  to  get  rid  of  the  coefficient 
operator  of  Xj.  Fortunately,  we  know  how  to  compute  the  infinite  series  expansion  of  for 
real  values  x:  y~  = 1 + 2 + a:2  + :r3  + ....  This  expansion  and  the  fact  that  <Sr  = 0 for  r > n 
suggest  the  following  theorem  for  inverting  (1  — bS). 

Theorem  1 (1  - bS)~l  = 1 + bS  + b2S 2 + ...  + 6ra_1<Sn-1. 

Proof,  (1  - bS ) E'To1  VS*  = Er=o  brSr  - E?To  &r+1<^+1  = 1 - bnSn  = 1,  because  <S"  = 0.  ■ 
Multiplying  equation  (9)  by  (1  — bS )_1  on  both  sides,  and  using  Theorem  1,  we  have 

X]  = (E  brSr)(aI  + cS)Xj + (E  brST)E,.  (10) 

r=0  t— 0 

Denote  by  A the  operator  that  is  the  coefficient  of  Xj- 1 in  the  last  equation.  Since  operator 
polynomials  multiply  in  the  same  manner  as  the  more  familiar  real  polynomials,  it  can  be  easily 
seen  that 

A =(£  br  Sr)(aL  + cS)=a+  J2(abr  + cbr~l)Sr.  (11) 

r=0  r— 1 

Denote  by  B the  coefficient  of  Ej  in  equation  (10),  and  let  Fj  = BE3  ,that  is, 

71  — 1 

8=52  VS*,  and  Fj  = FEj.  (12) 

r=0 

Now  equation  (10)  can  be  written  as  a 1-indexed  vector  recurrence  relation  of  order  1: 

A j — AXj—\  + Fj  (13) 

where  the  X,-’ s are  the  unknowns  (to  be  computed),  A is  a known  operator  with  known  param- 
eters (which  depend  on  the  given  values  a,  b and  c,  as  shown  in  equation  (11),  and  the  F/s  are 
known  vectors  because  they  are  computable  using  equation  (12). 

Although  equation  (13)  has  the  simple  form  of  a linear  recurrence  relation  of  order  1,  the 
“multiplication”  of  a vector  by  an  operator  is  not  usually  encountered  in  the  literature  on  parallel 
algorithms  for  recurrence  relations.  Neither  is  it  clear  how  to  parallelize  that  multiplication 
operation.  However,  the  next  theorem  below  will  show  that  multiplying  a vector  by  a polynomial 
operator  is  nothing  but  a convolution,  a partial  convolution  to  be  precise. 

Recall  that  the  convolution  of  two  vectors  JJ{ 0 : n — 1)  and  1/(0  : n — 1)  is  a vector 
W(0  : 2 n — 1)  where  W(i)  = Er=o  U(r)V(i  — r ),  using  again  the  convention  that  when  indices 
are  out  of  their  bounds,  the  corresponding  terms  are  zeros.  Only  the  first  n components  of 
W will  be  of  use  to  us.  Therefore,  we  will  denote  by  <g>  the  partial  convolution  operation  that 
returns  the  first  n components  of  the  answer  of  a full  convolution.  For  ease  of  reference,  we  will 
define  partial  convolution  formally. 

Definition  2 Let  £7(0  : n — 1)  and  V(0  : n — 1)  be  two  vectors.  The  partial  convolution  of 
U and  V,  denoted  U®V , is  a vector  W(0  : n — 1)  of  the  same  length  as  U and  V such  that 
W(i)  = EEo  U(r)V(i  - r),  for  i = 0, 1,  ...,n  - 1. 
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Since  full  convolution  is  an  associative  operation,  it  can  be  easily  shown  that  partial  convo- 
lution 0 is  associative  too. 

Theorem  2 Let  T = X)r=o  arSr  be  a polynomial  operator  of  characteristic  vector  T , where 
T(i ) = Gj,  and  let  V (0  : n — 1)  be  a vector.  Then  TV  = T0V. 

Proof:  Let  W = TV  = 52r=o  QrdTV,  and  recall  that  (<SrV)(z)  = V(i  — r).  Therefore,  W(i)  = 
I3r=o  ar(<5rV)(z)  = Sr=o  arV (*  — r)  = ^r=o  T(r)V(i  — r),  for  all  i = 0, 1, ...,  n — 1.  It  follows 
that  W = T0V,  and  thus  TV  = T0V.  ■ 

Denote  by  A and  B the  characteristic  vectors  of  operators  A and  B defined  in  equations  (11) 
and  (12),  respectively.  Then,  by  the  previous  theorem,  equation  (13)  and  the  accompanying 
definitions  can  now  be  put  in  a condensed  form: 

Xj  = A<g >X,_!  + Fj:  j = 0, 1, ...,  n - 1 (14. a) 

Fj  = B®Ej  (14. b) 

A(0)  = a,  A(r)  = abT  + cbT~l  for  r = 1,2, ...,  n — 1 (14.c) 

B(r)  = bT  for  r = 0, 1, ...,  n — 1.  (14. d) 

Note  that  the  initial  value  of  equation  (14.a)  is  X_i  being  the  zero  column  vector. 

In  the  next  section  a parallel  algorithm  for  computing  first  the  parameters  A and  F/s,  and 
then  the  vectors  Xj: s,  using  equations  14,  will  be  presented  and  analyzed. 


6 Parallel  Algorithm  for  2-Indexed  Relations  of  Order  1 

To  arrive  at  a suitable  architecture  for  solving  equations  (14),  two  observations  must  be  noted. 
First,  first-order  1-indexed  scalar  recurrence  relations  are  best  solved  on  a hypercube  or  a 
fully  partitionable  bus  [19,  20].  Second,  since  each  term  in  equation  14. a is  a vector,  and  the 
operations  involved  are  vector  operations,  namely,  0 and  vector  addition,  every  processor  in 
the  scalar  case  is  best  replaced  by  a column  of  processors,  preferably  connected  as  a hypercube 
or  a fully  partitionable  bus,  to  run  0 fast.  Therefore,  a good  topology  for  the  overall  system  is 
an  n x n mesh  where  every  row  and  every  column  is  either  a hypercube  or  a fully  partitionable 
bus.  (Note  that  an  n2- node  hypercube  can  be  configured  as  an  n x n mesh  of  subcubes.)  Each 
processor  (i,j)  hosts  Xj(i)  = X(i,j):  E(i,j),  Fj(i),  A(j)  and  B[j).  The  algorithm  will  first 
compute  vectors  A and  B in  the  leftmost  column  of  the  mesh,  and  then  broadcast  them  row- 
wise to  the  remaining  mesh  columns;  afterwards,  each  mesh  column  j computes  Fj  = B<g>Ej] 
finally,  equation  14.a  is  solved  in  parallel  by  all  the  n columns  of  the  mesh,  where  every  column 
is  treated  as  a subsystem  in  which  0 and  vector  addition  are  computed  in  parallel. 

Note  that  the  system  may  be  a p x p mesh  of  the  same  topology  as  before,  for  p < n.  In  that 
case,  each  of  the  matrices  X , E and  F is  divided  into  p x p square  blocks  of  size  ^ x | each, 
such  that  blocks  (z,  j)  reside  in  processor  (z,  j).  Also,  A and  B are  divided  into  p subcolumns 
of  size  n/p  each;  they  are  computed  in  the  leftmost  column  of  the  mesh,  and  then  the  z-th 
subcolumn  of  A and  of  B are  broadcast  row-wise  to  all  the  processors  in  row  z of  the  mesh,  for 
all  z.  The  computations  are  parallel  across  blocks,  but  sequential  within  the  blocks,  because 
each  block  is  allocated  to  a single  processor. 
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The  algorithm  is  given  next. 


Algorithm  Index-2-order-l  (input:  E,  a,  6,  c;  output:  A) 
begin 

/*  Setting  up  of  equation  14. a:  computing  B , A and  the  F/ s */ 

1.  The  1st  column  of  the  mesh  performs  parallel  prefix  to  compute  B( 0 : n — 1)  where 
B{r)  = bT\ 

2.  The  1st  column  of  the  mesh  first  computes  vector  D = SB  by  shifting  down  vector  B one 
step,  and  then  computes  vector  A in  parallel  asA  = a*B  + c*D] 

3.  The  1st  column  of  the  mesh  broadcasts  vectors  A and  B row-wise,  that  is,  for  each  row  z, 
the  1st  processor  hosting  A(i)  and  B{i)  broadcasts  them  to  all  the  processors  in  its  row; 

4.  for  j=0  to  n — 1 pardo 

5.  Compute  Fj  = B®Ej  by  a parallel  convolution  algorithm  on  a separate  column; 

endfor 

/*  Solving  recurrence  relation  (14. a)  */ 

6.  Solve  the  1-indexed  vector  recurrence  relation  X3  — A0A?_1  + Fj] 

Note  that  since  each  term  in  the  recurrence  relation  is  a column  vector,  each  vector 
is  handled  by  a separate  mesh  column,  and  each  vector  operation  involved  (0  and  +) 
is  performed  in  parallel  within  the  host  column  of  the  mesh  architecture. 

end 


Analysis 

First  assume  an  n x n mesh  of  hypercubes  or  partitionable  buses.  Step  1 takes  O(logn) 
time.  Step  2 takes  0(1)  time  if  using  partitionable  buses;  however,  if  the  leftmost  column  is  a 
hypercube,  the  shifting  down  of  B by  one  step  is  a mapping  i — > i + 1,  which  requires  O(logn) 
time  to  execute,  and  the  rest  of  the  computation  of  A takes  0(1)  time.  Step  3 takes  0(1)  time 
if  the  rows  are  buses,  and  O(logn)  if  the  rows  are  hypercubes.  Step  5 (and  thus  Step  4)  takes 
O(logn)  times  on  either  topology  using  FFT  [19,  20].  Finally,  step  6 takes  0((logn)(T®  + T+)), 
where  T®  is  the  time  to  execute  one  0 operation  on  one  mesh  column,  and  T+  is  the  time 
to  add  two  vectors  in  one  mesh  column.  Clearly,  T®  = O(logn)  as  indicated  earlier,  and 
T+  = 0(1).  Therefore,  step  6 takes  a dominant  0(log2n)  time,  leading  to  an  overall  algorithm 
time  complexity  of  0( log2  n).  The  corresponding  speedup  is  0(^p-^),  and  the  efficiency  of  the 
algorithm  is  O(^). 

If  the  mesh  is  p x p where  p < n,  the  same  analysis  applies  with  a few  adjustments. 
Essentially,  one  convolution  operation  0 takes  0(n/p  + logp)  on  p processors.  Therefore, 
step  1 takes  0(n/p  + logp)  time.  One  can  bypass  step  3 by  broadcasting  (a,  6,  c)  to  all  the 
processors,  and  letting  each  column  perform  steps  1 and  2 redundantly  and  independently  in 
0(n/p  + logp)  time.  Steps  4 and  5 involve  n/p  convolutions  per  column  of  processors,  thus 
taking  0{{n/p)(n/p  + logp))  time.  Finally,  step  6 takes  0((n/p  + logp)(T®  -f  T+)),  where 
now  T®  = 0(n/p  + logp)  and  T+  = 0(n/p)  because  of  using  only  p processors  per  each 
operation.  Therefore,  step  6,  and  consequently  the  whole  algorithm  take  0({n/p  + logp)2) 
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time,  If  plogp  < n,  then  logp  < n/p  and  the  whole  algorithm  takes  0{{n/p)2)  time,  resulting 
in  speedup  of  0(n2/(n/p )2)  = 0(p2)  and  efficiency  0(1). 


7 Generalization  to  2-Indexed  Relations  of  Arbitrary  Or- 
der 

as  mentioned  earlier,  predicting  values  in  DPCM  can  involve  more  than  one  previous  term  in 
each  matrix  dimension.  That  leads  to  higher  order  DPCM,  or  specifically  DPCM  of  order  (k,  t), 
for  which  the  DPCM  decoding  relation  is  equation  (3)  which  we  re-state  here  as  equation  (15): 

X{i,j)  =J2J2a(lis)X(i  ~ ~ s)+E(i,j ),  (15) 

1=0  s= 0 

where  a(0,  0)  is  equal  to  0 so  that  X(i,j)  does  not  depend  on  itself.  The  array  a(0  : k,  0 : t)  of 
DPCM  parameters  is  given,  and  k and  t are  typically  small  positive  integers. 

Equation  (15)  will  be  converted  to  a 1-indexed  f-th  order  vector  recurrence  relation,  using 
the  same  shift  operator  S.  Using  the  notation  Afj(i ) for  M(z,  j),  equation  (15)  becomes 

X}(i)  = E E a(l,  s)Xj-s(i  -l)  + Ej(i) 

1=0  s= 0 

Since  Xj-S(i  — l)  = (SlXj_s)(i),  the  above  equation  transforms  to 

XA0  = 't't(a(l,s)S‘X3-,)(i)+Ej(i),  for  all  i. 

1=0  s=0 

Because  the  last  equation  involves  the  i-th  components  of  the  various  column  vectors,  the  index 
i can  be  dropped,  and  the  equation  becomes  a condensed  vector  equation: 

s=0  1=0 

where  the  order  of  the  double  summation  is  reversed.  By  splitting  the  summation  over  s into 
two  parts,  one  for  s = 0 and  the  other  for  s ^ 0,  and  by  recalling  that  a(0, 0)  = 0,  we  obtain 

X,  = (E  a(l,  0)S‘Xj)  + E E a(l,  s)S‘Xj.t  + Er 

1=1  s=  1 i=0 

Next,  move  the  first  sum  on  the  right  hand  side  to  the  left,  and  factor  Xj  out,  resulting  in 

(l-Za(l,0)#)XJ  = 't{ta(l,s)Sf)XJ-.  + Ej.  (16) 

1=1  s=l  1=0 
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To  simplify  the  notation,  let 


75  = 1 -£«(«,  OJS1,  Ts  = 'ta(l,s)Sl 
1=1  1= 0 

so  that  equation  (16)  becomes  To Xj  = Es=i  TsXj-s  + Ej,  leading  to 

Xj  = + T0~'Ej.  (17) 

S=1 

As  we  did  with  equation  (5)  in  Section  4,  we  have  to  invert  the  polynomial  operator  %. 
The  inversion  here  is,  however,  more  complex  than  it  was  in  the  case  of  equation  (5).  As  the 
next  theorem  will  show,  T0~l  does  exist,  and  it  is  a polynomial  operator  whose  coefficients  will 
be  the  solution  of  a scalar  recurrence  relation  of  order  k. 

Theorem  3 1)  7q_1  = YlTXo  C(l)Sl , where  the  C(l)’s  satisfy  the  following  1-indexed  scalar 
recurrence  relation  of  order  k: 

k 

C(0)  = 1,  C(r)  = '^2a(l,0)C(r  — l),  for  1 < r < n — 1.  (18) 

z=i 

2)  For  every  s = 1, 2, . . . , t,  %~lTs  = Ef=o  Bs(l)Sl,  where 


Bs  = C®a(0  : n — 1,  s ). 


(19) 


Note  that  for  m > k,  a(m,  s ) = 0. 

Proof:  1)  The  C(l)'s  satisfy  (1  — T,’i=ia{l,0)Sl)(Yfi'TQ  C(l)Sl)  = 1.  In  that  identity  relation, 
the  coefficient  of  <S°  is  C(0)  on  the  left  side  and  1 on  the  right  side,  leading  to  C(0)  = 1.  For 
r = 1, 2,  ...n  — 1,  the  coefficient  of  S7-  is  C(r)  — a(^0)C(r  — l ) on  the  left  hand  side, 

and  0 on  the  right  hand  side.  Thus,  C(r)  = a(Z,  0)C(r  — /).  By  taking  C(m)  = 0 for 

m < 0,  the  value  of  C(r)  simplifies  to  C(r)  = J2i=i  a{U  0 )C(r  — l). 

2)  Thr  proof  of  this  part  follows  immediately  from  observing  that  operator  polynomials  multiply 
in  the  same  way  as  real  polynomials,  that  polynomial  multiplication  is  equivalent  to  convolution 
of  the  coefficients,  and  that  Sr  — 0 for  all  r > n.  ■ 

Using  Theorem  3,  and  recalling  from  Theorem  2 that  applying  a polynomial  operator  to  a 
vector  is  a convolution  ®,  equation  17  becomes 

Xj  = EUi  Bs®Xj-s  -f  Fj  (20. a) 

Fj  = C®Ej.  (20. b) 

Clearly,  equation  20. a is  a 1-indexed  vector  recurrence  relation  of  order  t.  This  completes 
the  dimension  shifting  from  index  dimension  to  term  dimension.  It  remains  to  shift  the  order 
dimension  to  term  dimension. 

The  standard  method  for  solving  1-indexed  recurrence  relations  of  order  t > 1 is  to  convert 
the  equation  into  a vector  equation  of  order  1 such  that  every  vector  has  t terms,  where  each 
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term  has  the  same  dimensionality  as  the  terms  of  the  original  recurrence  relation.  Specifically, 
equation  20. a will  be  converted  to  the  following  form 

X j = AQX  + F j (21) 

where  Xj  is  a column  vector  of  t vectors 

Xj  = [Xtj~ i Xtj_2  Xtj_2  ■■■Xtj-t]T,  j = 1,2,  ...,n/t, 

and  A , F j and  operation  © are  to  be  defined.  Note  that  in  the  definition  of  Xj , the  exponent 
T stands  for  matrix  transpose.  Note  also  that  Xq  is  the  zero  vector  of  length  t x n. 

The  conversion  from  order  t to  order  1 is  usually  straightforward,  but  because  each  term 
in  equation  20.a  is  a long  vector  and  the  operations  involved,  especially  ©,  are  fairly  costly, 
the  conversion  becomes  a considerable  and  time-consuming  task,  and  thus  warrants  special 
attention  to  parallelize  it.  The  conversion  entails  computing  the  matrix  A and  the  vectors  F/s, 
and  defining  the  operation  ©. 

The  values  of  A and  the  F j' s are  derived  by  repeated  application  of  equation  (20. a)  so  that 
each  Xtj-r , for  r = 0, 1,  ...,t  — 1,  is  expressed  in  terms  of  Xt(j~ i)_l3  X£(j_i)_2,...,.Xf(j_i)_t  and 
Ftj-r,  Ftj-r-i,...,Ftj-t.  We  summarize  next  the  outcome  of  that  derivation/conversion  process, 
leaving  out  the  derivation  details  for  the  sake  of  brevity. 


• The  matrix  A in  equation  (21)  is  a t x t matrix  where  each  term  is  a vector  of  length 
n,  for  2=1,  ...,i  and  s = 1,2,  ...,t.  Specifically,  A^s  satisfies  the  following  equation 


At,s  — Fs,  — Bs^Ai^i  T A^s. j_i , i t 1, 2,  s — 1, 2, ...,  t 


(22) 


where  Bs  was  defined  in  equation  (19)  within  Theorem  3. 


• For  every  column  vector  Y = [Ti  Y2  ...  Yt]T  where  each  Yi  is  a vector  of  length  n, 

t 

AQY  = Z , where  Z(i)  = ^2  AiiS<g>Ys.  (23) 

S=1 

Note  that  A(-)Y  is  similar  to  a matrix-vector  product  except  that  the  scalar  multiplication 
operation  is  replaced  by  the  vector  convolution  operation  <g>.  Note  also  that  if  © is 
performed  on  a system  consisting  of  t subsystems,  where  subsystem  i computes  Z(i), 
then  the  time  TQ  = txT®  + (t  - 1)  x T+.  In  particular,  if  each  subsystem  has  n 
processors,  then  T®  = O(logn)  time  by  performing  parallel  convolution  through  FFT, 
and  T+  = 0(1)  because  it  is  a parallel  vector  addition.  Hence,  T©  = O(tT0)  = O(tlogn). 

• Each  column  vectors  Fj  in  equation  21  is  of  the  form 

F,  = [FW  Fj3  . . . Fj,t]r,  (24) 

where  the  terms  FjjT  are  themselves  vectors  of  length  n,  satisfying  the  following  equation 

Fjj  — Ftj—t , and 

t 

for  r = 1,2, ...,  t - 1,  Fj?r  = Ftj-r  + J]  A£+r+i_S)i©F£j_s,  j = 1, 2, ...,  n/t.  (25) 

s=r+l 

Note  that  the  FtJ_r’ s are  defined  according  to  equation  20. b. 
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The  general  DPCM  decoding  algorithm  must  solve  equation  (21).  For  that  purpose,  it  must 
first  compute  equations  (18),  (19),  (20.b),  (22)  and  (25).  To  understand  the  algorithm,  it  helps 
to  view  the  n x n mesh  architecture  in  two  ways,  one  way  as  a sequence  of  n columns  labeled 
globally  0,1,..., n — 1,  and  another  way  as  partitioned  into  n/t  contiguous  subsystems  of  t 
contiguous  columns  each  (n/t  is  assumed  to  be  an  integer  for  simplicity).  The  subsystems  are 
labeled  1,  ...,n/£,  and  the  columns  within  each  subsystem  are  labeled  locally  1,2,  ...,£. 

Data  Assignment 

Assume  that  each  subsystem  has  the  array  a such  that  each  coi  imn  s of  array  a is  stored  in 
the  s-th  column  of  each  subsystem,  one  term  per  processor.  Recall  again  that  for  m > k, 
a(m,  s)  = 0.  In  addition  to  column  a(.,  1),  column  0 of  the  mesh  has  a(.,0).  The  columns  Ej 
and  Fj  are  stored  in  column  j of  the  mesh,  that  is,  E(i,j ) and  F(i,j ) are  in  processor  (i,  j). 
Each  local  column  s of  each  subsystem  j hosts  vectors  Bs,  A^s  for  alH  = 1, 2, ...,  £,  and  FjiS. 

Algorithm  Index-2-order-k-t  (input:  E,a( 0 : /c,  0 : £);  output:  A(0  : n — 1, 0 : n — 1)) 

begin 

1.  Column  0 of  the  mesh  computes  C(0  : n — 1)  by  solving  the  1-indexed  scalar  recur- 
rence relation  (18)  of  order  k,  using  any  standard  parallel  recurrence  relation  solver. 
Afterwards,  C is  broadcast  row- wise  to  all  the  columns  in  the  mesh. 

2.  for  j— 0 to  n — 1 pardo 

3.  Grid-column  j does:  Fj  = C®Ej  by  a parallel  convolution  algorithm; 
endfor 

4.  for  s = 1 to  t pardo 

5.  Local  column  s of  each  subsystem  does:  Bs  — C<S>a( 0 : n — 1,  s);  AtiS  = Bs\ 

endfor 

6.  for  i = t down  to  2 do 

7.  Local  column  1 of  each  subsystem  broadcasts  Ai j rowwise  to  the  columns  of  its 
subsystem; 

8.  Each  column  s = 2,3 ,...,£  in  each  subsystem  sends  rowwise  the  column  vector 
Ai,s  to  the  previous  mesh  column; 

/*  Steps  7 and  8 deliver  data  to  processors  to  compute  equation  (22)  next.  */ 

9.  for  s = 1 to  t pardo 

10.  Local  column  s of  each  subsystem  does:  i,s  = Bs<g>A^i  + Ai)S+i; 

endfor 

endfor 

11.  for  j= 1 to  n/t  pardo 

12.  for  r= 1 to  t pardo 

13.  Local  column  r of  subsystem  j does: 

it  gets  Ftj-S  from  local  column  s of  subsystem  j,  for  s = r,r  + 1,  ...,£; 
it  then  computes  Fj >r  = Ftj_T  + ^=r+1  At+r+1_Sti®Ftj-s; 
endfor 
endfor 

14.  All  the  subsystems  of  the  mesh  work  together  to  solve  the  1-indexed  recurrence  relation 
Xj  = AQXj-i  + Fj  in  parallel,  where  Xj  resides  in  subsystem  j; 

end 
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Analysis 


Step  1,  being  a 1-index  scalar  recurrence  relation  of  order  k , takes  0{k  log|)  time.  The 
broadcast  of  C afterwards  takes  0(1)  time  on  partitionable  buses,  or  O(logn)  on  a mesh  of 
hypercubes.  Step  3 (and  thus  step  2)  takes  O(logn)  time,  being  a convolution.  Similarly,  steps 
4 and  5 take  O(logn)  time  for  the  same  reason.  The  communication  steps  7 and  8 take  0(1) 
time  on  a mesh  of  partitionable  buses,  or  O(logn)  on  a mesh  of  hypercubes.  Steps  9 and  10  take 
O(logn)  time,  since  they  involve  a convolution  and  a vector  addition.  Step  13  requires  each  local 
column  s in  each  subsystem  j to  bicadcast  its  vector  Ftj_s  to  several  columns  r = 1, 2, ...,  s — 1 
within  the  same  subsystem.  Each  such  broadcast  takes  0(1)  time  on  partitionable  buses,  or 
O(logn)  time  on  a mesh  of  hvpercubes.  Since  step  13  requires  £ such  broadcasts,  which  have 
to  be  performed  one  after  another,  the  communication  time  of  step  13  is  0[t)  on  partitionable 
buses,  or  O(tlogn)  on  a mesh  of  hypercubes.  After  the  broadcasts,  the  computation  of  Fj >r 
in  step  13  takes  O(tlogn).  Step  14,  being  a 1-indexed  recurrence  relation  of  order  1,  takes 
0((T©  -f  T+)  log  j).  The  time  T©  = 0(£T©)  = 0(£logn)  as  was  explained  right  after  equation 
(23).  Therefore,  the  total  time  complexity  of  the  algorithm  is  0(k  log  | + £lognlogy),  which 
is  dominated  by  0(t  log2  n)  for  small  values  of  £ and  k.  Since  the  sequential  time  of  computing 
the  original  recurrence  relation  (15)  is  0(ktn 2),  it  follows  that  the  speedup  is  0{kn2/ log2  n) 
and  the  efficiency  is  0(k/  log2  n). 

8 Generalization  to  Multi-indexed  Relations 

In  this  section  we  treat  the  most  general  case,  namely,  g-indexed  relations  of  arbitrary  order, 
where  q > 2.  The  treatment  will  require  a generalization  of:  (1)  the  ID  shift  operator  to 
( q — l)-dimensional  shift  operators,  (2)  the  subsequent  generalization  of  the  “single- variable” 
polynomial  operator  to  something  that  resembles  a multivariable  polynomial  operator,  and  (3) 
finally  the  use  of  ( q — l)-dimensional  convolution  rather  than  ID  convolution.  The  conversion 
process  from  k- indexed  arbitrary-order  relations  to  1-indexed  relations  of  order  1 is  otherwise 
the  same  as  in  the  previous  sections. 

Recall  from  equation  (4)  that  a g-indexed  recurrence  relation  of  order  (£i,£2,  -~,£g)  has  the 
following  form: 


t\  t-2  tq 

- Y.  a(F,r2,...,rq)X{i1-r1,i2-r2,...,iq-rq)  + E(il,i2,...:iq)  (26) 

n=0r2=0  r<j=0 

where  = 0,1,...,  Ni  — 1 for  l = 1,2, ...,  g,  the  array  E{ 0 : N\  — 1,0  : N2  — 1, ...,  0 : Nq  — 1) 
and  the  array  a(0  : £x,  0 : £2,  — , 0 : tq ) of  parameters  are  given,  a(0, 0, ...,  0)  is  equal  to  0 so  that 
X(ii,  i2, ...,  iq)  does  not  depend  on  itself,  and  ti  « Ni  for  l = 1, 2, ...,  q.  As  before,  any  term  is 
assumed  to  be  zero  if  its  indices  are  out  of  their  defined  bounds. 

To  simplify  the  multi-index  cumbersome  notation,  we  will  introduce  some  notations  and 
conventions.  Let 

• 7 1 = {(rl5  r2, ...,  rq ) | rt  = 0, 1,  ...£/  for  all  l = 1,2, ...,  g},  and  any  r in  71  is  by  convention 
a vector  r = (rl5r2,  ~-,rq). 


16 


• 'll'  = {(n,  r2, rq_i)  I ri  = 0, 1,  ...t;  for  all  l = 1, 2, q - 1}. 

• 71*  = 71  — {(0,  0. 0)},  that  is,  the  set  71  except  of  the  zerotuple. 

• 71'*  = 71'  - {(0,0,...,0)}. 

• X = { (zi,  z2, iq)  | ii  = 0, 1,  ...TV;  — 1 for  all  l = 1,2, ...,  q),  and  any  / in  X is  by  convention 
a vector  i = (ij,  i2, ...,  iq). 

• X'  = {(?!,  z2, ...,  iq-i)  | ii  = 0, 1,  ...TV;  - 1 for  all  l = 1,2, ...,  q - 1}. 

• Convention:  If  r = (ri,  r2, ...,  rq)  is  an  index  vector  in  71 , then  r'  is  vector  derived  from 
r by  dropping  the  last  term  of  r.  Clearly,  r'  = (r1?  r2, ...,  rq_i)  G 71',  and  ( R',rq ) = r. 
Similarly,  for  any  index  vector  / in  X,  denote  by  /'  the  vector  derived  from  / by  dropping 
the  last  term  from  the  latter. 

• Convention:  For  any  ^-dimensional  array  M and  any  integer  iq,  denote  by  Miq  the  (q  — 
l)-dimensional  array  such  that  Miq(ii,  i2, ...,  iq~i)  = M[ii,  z2, ...,  iq~\,  iq).  Equivalently, 

Miq(r)  = M(i). 


Accordingly,  equation  (26)  becomes 


*(/)=  E a(«)X(f  - «)  + £(/). 

RaTV 


(27) 


Using  the  last  two  conventions  indicated  above,  the  last  equation  translates  to 

Xi,(i')=  E oWAi.-Jr  -R')  + Ei, (/'). 

R~1Z- 

Split  the  multiple  summation  on  the  right  hand  side  of  the  previous  equation  into  two  parts: 
one  for  rq  = 0 and  another  for  rq  ^ 0.  This  leads  to 


= Y,  a{R',0)Xig{r  -R')+J2  a(R',rq)Xiq.rq{r  - r')  + Elq{r).  (28) 

R'eTl'*  R'eTl' 


At  this  point  we  introduce  the  generalization  of  the  shift  operator.  For  each  vector  r'  = 
(rq,  r2, ...,  r9_i)  of  non-negative  components,  define  the  shift  operator  Sr  to  operate  on  {q  — 1)- 
dimensional  arrays  V of  size  x TV2  x • • • x TV9_X  as  follows:  Sr'V  is  vector  of  the  same 
dimensional  structure  as  V such  that  ( Sr'V)(i ')  = V(r  — r')  for  all  /'  G X'.  Therefore,  Xiq(r  — 
r' ) = (Sr'X1q )(r).  Note  that  if  r;  > TV;  for  some  /,  then  SR’  is  the  zero  operator,  that  is  Sr>V 
is  equal  to  the  zero  array  of  size  TVX  x TV2  x • • • x TV9_X.  Note  also  that  if  R'  and  s'  are  two  index 
vectors  in  71',  then  it  can  be  easily  verified  that  SR'Ss<  = <Sr'+s'-  Indeed,  we  can  correspond 
to  SR > a polynomial  term  xrf  xr22 ...xrqqXi  of  q — 1 variables;  the  product  (i.e.,  the  composition) 
of  two  operators  SR'SS'  is  equivalent  to  the  product  of  their  corresponding  polynomial  terms. 
Hence,  one  can  define  "multivariable”  polynomial  operators. 
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Definition  3 Given  a (q-1) -dimensional  real  array  T(i'),  r G T , the  operator  T — J2rej'  T(i')Sp, 
called  the  multivariable  polynomial  operator  of  characteristic  array  T,  is  such  that  for  any  array 
V of  size  Ni  x N2  x • • • x Nq-1}  TV " = T(i')(SrV). 

Equation  (28)  can  now  be  put  in  the  following  form: 

t<) 

Xi,{v)  = ( E a(*',0)SffXi,)(/')  + E(  E '•l*,Tq)SB:Xi,-T,)(r)  + £„(,').  (29) 

RelZ'*  r<=1  R'elZ' 

Since,  the  vector-index  r is  the  same  in  all  the  terms  of  the  previous  equation,  we  can  drop 
it  resulting  in  a compact  (q  — l)-dimensional  array  equation 

tq 

Xlq  = X ^(^,0)5^^  + X X a{R'irq)^R'^iq-Tq  + Eiq.  (30) 

R'elZ'*  r^1  R'elZ' 


Next,  move  the  first  summation  on  the  right  hand  side  to  the  left,  and  factor  Xlq  out.  This 
results  in 


(1-  X a(R',0)SR/)Xiq 

R'elZ'* 


tq 

X X!  a{R'')  rq)<SR' X-iq-Tq 

r<?=1  R'elZ' 


+ E 


(31) 


We  simplify  the  notation  by  defining 

7o  = 1 — X a(R',0)SR',  Trq  = X a(R,Tq)SR',  for  rq  = l,2,...,t9. 

R'elZ'*  R'elZ' 

As  a result,  equation  (31)  becomes  ToXiq  = TlTi  TqXiq-Tq  4-  Eiq,  leading  to 

Xi,  = E 7T%Vi,-,,  + Ta~'Eiq  (32) 

Tq  = i 


The  next  theorem,  which  combines  and  generalizes  Theorems  2 and  3 to  multivariable 
polynomial  operators,  will  show  that  the  application  of  a multivariable  polynomial  operator 
to  an  an  array  is  equivalent  to  multidimensional  partial  convolution,  that  the  inverse  of  T 
exists  and  can  be  computed  by  solving  recursively  a (q  — l)-index  recurrence  relation  of  order 
(ti,  t2, ...,  tq- 1),  and  that  the  product  of  two  multivariable  polynomial  operators  is  a polynomial 
operator  whose  characteristic  array  is  the  partial  convolution  of  the  characteristic  arrays  of  the 
factor  operators.  To  that  effect,  we  first  define  partial  multidimensional  convolution,  and  then 
prove  the  theorem  afterwards. 


Definition  4 The  partial  convolution  of  two  N\  x A72  • • • x Nq-\  arrays  U and  V is  an  N\  x 
N2  • • • x Nq-i  array  W , denoted  W = U<8>V,  such  that 

W (*!,  *2,  Vi)  = Er1=o  X22=o  - EXUo  U (ri,  r2, rq- i)V (A  - rx,  i2  - r2, ...,  iq- 1 - r,_i). 
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Theorem  4 1)  Let  T and  V be  two  N\  x N2  x — x Nq-i  arrays,  and  7~  i/ie  multivariable 
polynomial  operator  of  characteristic  array  T.  Then,  TV  = T®V . 

2)  7q  1 = where  C is  an  N\  x N2  x • • • x array  that  satisfies  the  following 

( q — l)-mde:r  recurrence  relation  of  order  (ti,t2, 

C(0.0,....0)  = 1,  C{i')=  2 a(R',0)C(/'-R')  (33) 

r'FR' 

3)  For  every  i,  = 1,2 V'VS  = Hrehip  Biq(i')Sy,  where 

Biq=C®aiq,  (34) 

and  a*,  is  an  Ni  x N2  x ■■■  x .Vg_i  array  sac/z  that  aiq(ii,i2,  ...,iq-\)  = a(ii,i2,  ...,iq_1:iq)  if 
(ii,i2,  ...,iq_ i)  ?5  m 72/ , and  equal  to  zero  otherwise. 

Proof:  The  proof  of  (1)  follows  the  same  reasoning  as  in  the  proof  of  Theorem  2.  The  proof 
of  (2)  and  (3)  follows  the  same  reasoning  as  in  the  proof  of  Theorem  3.  ■ 

Now,  by  using  the  previous  theorem,  we  convert  equation  (32)  to  a form  identical  to  equa- 

tions (20)  (correspond  iq  to  j and  rq  to  s): 

Xiq  = 2r,=i  Frq®Xiq_Tq  + Fiq  (35. a) 

Fiq=C®Eiq.  (35.  b) 

Clearly,  equation  (35. a)  is  a 1-indexed  recurrence  relation  of  order  tq,  where  every  term  is 
an  jVj.  x A^2  x - • • x Nq_i  array.  This  completes  the  dimension  shifting  from  index  dimension  to 
term  dimension.  The  shifting  of  the  order  dimension  to  term  dimension  from  here  on  mirrors 
the  same  process  conducted  in  the  previous  section  starting  from  equation  (21)  onward,  and, 
therefore,  will  not  be  repeated.  The  resulting  1-index  recurrence  relation  of  order  1,  similar  to 
equation  (21),  is 

Xiq  = AQXiq-\  + Fiq  (36) 

The  definitions  of  A , Xiq,  Fiq , Fiq,rq,  and  © are  formally  the  same  as  in  the  previous  section, 
with  the  proper  substitutions.  Note  in  particular  that  A is  a tq  x tq  matrix  where  every  term 
is  an  Ni  x N2  x ■ • • x Nq- 1 array,  and  the  convolution  © involved  in  the  definition  of  0 is 
now  a (q  — l)-dimensional  convolution  on  N\  x N2  x • • • x 7Vg_!  arrays. 

The  architecture  is  naturally  an  N\  x N2  x ■ • • x Nq  mesh  of  hypercubes  or  of  partitionable 
buses.  As  in  the  previous  section,  the  mesh  can  be  viewed  as  Nq  submeshes  (as  opposed  to 
columns),  each  being  an  ATi  x A^x  • • • x Nq-i  mesh.  The  submeshes  are  labeled  0, 1, Nq— 1.  On 
the  other  hand,  the  whole  mesh  can  be  viewed  as  partitioned  into  Nq/tq  contiguous  subsystems 
of  tq  contiguous  submeshes  each.  The  subsystems  are  labeled  1, 2, ...,  Nq/tq , and  the  submeshes 
within  each  subsystem  are  labeled  locally  1,2,  ...,tq. 

Data  Assignment 

Assume  that  each  subsystem  has  the  parameter  array  a such  that  the  (sub)  array  arq  is  stored  in 
the  r9-th  submesh  of  each  subsystem,  one  term  per  processor.  In  addition  to  array  a\,  submesh 
0 of  the  mesh  has  the  array  a0.  The  arrays  Eiq  and  Fiq  are  stored  in  submesh  iq , that  is, 
E{i\,  i2, iq)  and  F(ii,  i2, ...,  iq)  are  in  processor  (fl5  i2, ...,  iq).  Each  local  submesh  rq  of  each 
subsystem  iq  hosts  arrays  BTq , Ai^q  for  all  i = 1,2 and  FiqjTq. 
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The  algorithm  for  solving  equation  (36)  is  largely  identical  to  the  Index-2-order-k-t  algorithm 
for  equation  (21)  presented  in  the  previous  section.  To  map  the  latter  algorithm  to  an  algorithm 
for  equation  (36),  only  three  major  changes  are  needed: 

• Step  1 of  the  algorithm  must  be  a recursive  call  to  compute  the  array  C of  the  (q  — 1)- 

indexed  recurrence  relation  (33)  of  order  (£1?  t2,  i9_i). 

• Add  a basis  step  corresponding  to  the  case  where  q = 2,  in  order  for  recursion  to  work. 

• Substitute  “submesh”  for  “column”  and  “Grid-column”,  “(g-l)-dimensional convolution” 
for  “convolution”,  “array”  for  “vector  column”,  tq  for  £,  Nq  for  n,  iq  for  j,  rq  for  s. 


Algorithm  General(input:  £(0  : Ni , ...,  0 : Nq),  a(0  : ti, ...,  0 : £g);  out:  X(0  : iVi, ...,  0 : Nq)) 

begin 

0.  Basis  step:  if  q = 2,  call  algorithm  Index-2-order-k-t  and  return; 

1.  Submesh  0 calls  the  algorithm  “General”  recursively  to  compute  C(0  : iVi, ...,  0 : Nq~ i) 
by  solving  the  ( q — l)-indexed  scalar  recurrence  relation  (33)  of  order  (f^  £2,  ••-,  £<?- 1)- 
Afterwards,  C is  broadcast  along  dimension  q to  all  the  submeshes. 

2.  for  iq= 0 to  Nq  — 1 pardo 

3.  Submesh  iq  does:  Flq  = C®Eiq  by  a parallel  convolution  algorithm; 

endfor 

4.  for  rq  = 1 to  tq  pardo 

5.  Local  submesh  rq  of  each  subsystem  does:  Br<]  = C®aTq ; Atq}Tq  = BTq, 

endfor 

6.  for  i = tq  down  to  2 do 

7.  Local  submesh  1 of  each  subsystem  broadcasts  A^i  along  dimension  q to  all  the 
submeshes  of  its  subsystem; 

8.  Each  submesh  rq  = 2. 3, ...,  tq  in  each  subsystem  sends  the  vector  array  A^rq  one 
step  down  along  dimension  q in  the  mesh; 

/*  Steps  7 and  8 deliver  data  to  processors  to  compute  A next.  */ 

9.  for  rq  = 1 to  tq  pardo 

10.  Local  submesh  rq  of  each  subsystem  does:  iir?  = Brq®Ai^  + A^rq+i\ 

endfor 

endfor 

11.  for  iq= 1 to  Nq/tq  pardo 

12.  for  r=l  to  tq  pardo 

13.  Local  submesh  r of  subsystem  iq  does: 

it  gets  Ftqiq—rq  from  local  submesh  rq  of  subsystem  iq,  for  rq  = r,r  + 1, 
it  then  computes  FiqJ  = Ftqiq-r  + 
endfor 
endfor 

14.  All  the  subsystems  of  the  mesh  work  together  to  solve  the  1-indexed  recurrence  relation 
Xiq  = A(DXiq- 1 + Fiq  in  parallel,  where  Xiq  resides  in  subsystem  iq ; 

end 
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Analysis 


Let  T(Ni,ti,  A2,  t2i Nq , tq)  be  the  parallel  time  complexity  of  the  whole  algorithm,  and  let 
A = NxN2...Nq.  Step  1,  being  a recursive  step,  takes  T(A1,t1,  A2,f2, ... ,Nq-i,tq-i)  time.  The 
time  of  all  the  remaining  steps  is  dominated  by  the  last  step,  step  14,  which  is  the  solution  of  the 
1-indexed  recurrence  relation  of  order  1,  equation  (36),  and  thus  takes  0((TG+T+)  log  time. 
Note  that  © is  a matrix-vector  product  where  the  matrix  is  tq  x tq  and  the  multiplication  oper- 
ation is  replaced  by  (q  - l)-dimensional  <8>.  Multidimensional  convolution  can  be  implemented 

by  multidimensional  FFT,  performed  one  dimension  after  another,  in  0(log  A],  + log N2  H 1- 

log Ng-i)  - 0(log(Ari7V2...A9_1))  time  because  FFT  in  dimension  l takes  O (log Ni)  time.  Since 
Te  = 0(tqT®)  (as  was  explained  after  equation  23),  it  follows  that  T0  = 0(tq  log (N/Nq)).  Since 
also  T+  = 0(1),  it  follows  that  step  14  takes  0(tq  log(Ni A2...A9_i)  \og(Nq/tq)).  Consequently, 
we  have 


T{NU  tl5 ...,  Nq,  tq)  = T(Ni,  tu  ...,  Nq_u  VO  + 0(tq  log^.-A^)  \og(Nq/tq)). 

This  recurrence  relation  easily  yields  that 

T(NU  tu  A2,  t2, ...,  Nqi  tq)  = T(Ni,  tx)  + 0(Ef=  2 ti  lo&(Ni—Ni-i)  log(JVi/t4)) 

Clearly,  T(NX,  tx)  = 0(tx  log  ^)  because  it  corresponds  to  solving  a scalar  1-indexed  recurrence 

relation  of  order  tx.  Since  \og(Ni...Ni-i)\og(Ni/ti)  < log  N log  (Ni /tt)  < \ogN\ogNi  < log2  A, 
the  time  formula  for  the  whole  algorithm  simplifies  to 

T(Nuti,N2,t2,...,Nq,tq)  = 0(QTfz)  log2  N). 

i=i 

Since  the  sequential  time  of  solving  equation  (26)  is  0(tit2...tqN),  it  follows  that  the  speedup  is 
logMv)’  an<^  efficiency  is  0(^^^j^p-^),  where  N is  the  size  of  the  output  array 

X.  In  the  typical  cases  where  the  ti  s are  small  constants,  the  overall  time  becomes  0( log2  N ), 
and  the  speedup  and  efficiency  become  0(N/ log2  N)  and  0(1/ log2  A),  respectively. 


9 Conclusions 

In  this  paper  we  developed  a dimension-shifting  approach  for  novel  parallel  solution  of  multi- 
indexed  recurrence  relations  of  arbitrary  order  in  square-logarithmic  time,  using  a mesh  of 
hypercubes  or  of  partitionable  buses.  Multi-indexed  recurrence  relations  have  not  been  ad- 
dressed before,  but  hold  special  significance  due  to  their  application  to  DPCM,  which  is  a 
standard  image  compression  technique. 

We  introduced  two  other  techniques,  index  sequencing  and  index  decoupling.  The  first 
is  very  simple  and  works  well  for  a small  number  of  processors,  but  does  not  exploit  all  the 
potential  for  parallelism.  The  second  is  limited  in  its  applicability,  but  when  applicable  it  is 
faster  than  the  other  two  techniques.  Indeed,  it  takes  logarithmic  time,  which  is  optimal. 

For  future  work,  it  is  of  practical  interest  to  consider  how  best  to  load-balance  and  possibly 
pipeline  the  algorithms  “Index-2-order-k-t”  and  “General”  on  a system  of  fewer  processors  than 
the  size  of  the  output  A,  for  the  purpose  of  minimizing  the  parallel  time  complexity. 
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